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ABSTRACT 

State Departments of Education are turning to the use 
ef-criterion referenced, as oppcsed to norm referenced, models for 
statewide assessment. The underlying assuaption in this turn of 
events is that results generated by criterion referenced tests within 
the statewide assessment context permit the drawing of value 
inferences atout the effectiveness of the educational curricula under 
study. The tenability of this.assumption is examined in light of 
rigorous requirements for test censtruction and validation. The 
extent tc which the test construction steps can be followed closely 
to yield a ccntent valid test determines the extent to which the 
tests can be justifiatly used to evaluate the curricula or programs 
under study. In summary, it is to be concluded that the content 
_walidity “of weasuring instruments must be carefully established in 
crder to ensure seaningfuy and defensible decision making. The risk 
involved in using an invalid test must be judged in terms of: the 
costs (psychological, financial, etc.) attendant on making erroneous 
decisions in a given situation. (MV) 
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The paper analyzes the methodologies and results of. selected state-_ 
wide assessment programs in terms of their implications for evaluating 
curricula in related subject-area domains. The purpose ds to datarnine: 
the scope and limitations in usefulness of statewide data, given differ- 
ent characteristics of the tests designed to measure the objectives , 
within a domain. . Critical issues related to the selection of objectives: 
and items for criterion-referenced instruments and the resulting content 


validity of the tests are discussed. 


Purpose ; ‘ ; 


In response to increasing, claims about the merits of criterion- 


referenced testing ap example, Popham & Husek, 1969; Wnitesen & Novick, 


"1973; and Popham, 1976) as well as to the impetus provided by the Na- 
tional Assessment of Educational Progress (NAEP) (Finley & Berdie, 1970; 
womet 1970), State Departments of Education are turning more readily to 
the use of criterion-referenged, as opposed to norm-referenced, models ° 
* for statewide assessment. The underlying assumption in this turn of events 
is that results generated by criterion-referenced tests within the state- 
wide assessment context permit the drawing of valid inferences about the : 
effectiveness of the educational curricula under study. 

This paper examines the tanability of this assumption in the light 
of rigorous requirements for test construction and validation. Some ten- 
tative suggestions for overcoming\certain obstacles ‘to this process are 
offered, with the ultimate goal of underscoring the degree to which gg 
usefulness of statewide assessment data depends on the development Of 
content. valid tests. 


A Nate on Definitions 


, 


\While numerous definitions of criterion-referenced tests have been \: 
offered, the common denominator appears to be that such tests are inten- 


tionally constructed so as to yield information on the competence of in- 


dividuals relative to specified instructional performance tasks (for 


example, Hambleton and Novick, 1973; a 1972 ; Glaser & Nitko, 1971). 
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The salient distinction raised by some authors (for example, Hanbleton 
et al., 1975) is that neasurenents that. are — interpretable in 
terms of specified performance standards need ae be analyzed solely to 
permit mastery decisions. Given the pyrpose of the testing program, one 
of two analytic approaches may be most suitable: 1) the determination 
of a “level.of functioning" score for each individual or what Mil]man 
(1974) has termed the “estimation of domain scores," and 2) the assign- 
ment of individuals to mastery states (%@.g., masters and non-masters) 
based on a criterion cut-off or threshold score. 

Most typically, statewide assessment programs adopt the "level of 
functioning" analysis (generally applied in terms of the average percen- 
tage of students answering correctly the items referenced to cath perform- 
ance objective). The ascniption of mastery states has been generally by- 
passed due to difficulties in gaining consensus on specific cut-off scores 
and due to the problem of adequate numbers of items per objective discussed 
later. » 

Since the word "criterion" refers to that "minimal acceptable level 
of functiontng that an examinee must achieve in order to be assigned to 
a mastery state" (Hambleton et -al., 1975), it would cena that "“criterion- 
referenced" is not the best modifier for the majority of statewide assess- 
ment programs. In the cases where mastery ascriptions are not a focus of 
the program, “domain-referenced” (Hambleton et al., 1975) Sp ma geen Ne 
referenced" (Schooley, 1976) would seem more suitable and less misleading 
terms. These terms imply more directly the statewide practice of design- 
ing high-priority learning objectives each of which specifies: (or attempts 


to specify) a domain of items from which a.sample of items should be 
¢ Nas ¢ 


~ vo 


b 


selected. However, because "criterion-referenced: testing" is the more 


familiar term within the context, the terms are used interchangeably here. 


A: Note on the Purposes 
of Statewide Assessment 


It should not be assumed that al] statewide, assessment programs em- 
brace snl one purpose, or for that matter, sedi the same set of purposes. 
At a general level, 411 can be said to have at least one underlying goal 
the provision of appropriate information for decision making—which is 
\ . - the essence of any "good evaluation system" (Schooley et al., 1976). 
; Beyond this generic goal, however, the purposes are. diverse: Rein- 
“stein (1976) has provided an excel lent, review of these purposes some of 
which are reproduced here: 1) developing state planning statements and 
priorities; 2) determining the extent to which students in a state have 
{ attained the skills, knowledge, and attitudes reflected in the educational 
goals of the-state; 3) determining if students are acquiring "Survival 
level skills" or "minimum competencies"; and 4) allocating state grants- * 
in-aid to alleviate weaknesses in instructional programs. “J 
Without de-emphasizing Reinstein's (1976) cautions panupaing the im- 
plemanteticn oF these purposes, it may be seen that many of them corres- 
pond nicely to two of the major uses of cniterion-referénced tests out- 
lined by Mil Iman (1974), including "needs assessment" and "program evalu- 
ation." | . 
Millman's third use, "individualized instruction," only marginal ly ; ~e 
applies; while some statewide programs include reporting results for indi- . 
viduals, the time frame usually permits only summative as opposed to forma> 
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tive evaluations. Millman's fourth use, "tepcher improvement and persdnnel 


‘ 


evaluation," is most often avoided by statewide assessments primarily 


- -because of the political problems that .obtain. 


‘However, as other authors have pointed out (notably Hambleton et al., 
1975), these different uses of criterion-referenced tests do-not have dif- 
ferential implications for the construction oftests, Morgan gt al. (1976) 
tend to disagree, and specify a set of evaluative ertterta, tor determining 
the applicability to a given purpose of a test constructed in a given way. 
Nevertheless, those criteria which oe most relevant in the present con- 
text (subject area coverage, testing time, cunrteulon match, and Stability/ 
; number of items per objective) seem to be subsumed in the comments of Ham - 
bleton et al., (1975) and will be discussed in detail later. The point 
: is'that defensible construction and content validation procedures must be 
observed regardless of the intended use of the criterion-referenced measure- 
ments. If such is not the case, and the resélting instruments are not con- 
tent valid, then decisions based on the data gereyated are likely to be 


unjustified at best and wrong at worst. 


The Developmental Process og ’ 


In their excellent monograph on criterion-referenced testing and mea- 


surement, Hambleton et al., (1975) in close agreement with Fremer (1974), 


_ outline what they consider to be the major “domain-referenced test con- 


struction steps": 1) task analysis, 2): definition of, content domain, 3) 
generation of referenced test items, 4) item analysis, 5) item selection, 
and 6) test reliabjlity and content validity ‘check. 


The succeeding discussion focuses heavily on the first two steps 


A . 
since these activities aré the most problematic within the statewide context 
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and treats more briefly steps’ three through five. The discussion is — 
directed at the issue of content validity—the production of criterion-° 
referenced tests that permit valid inferences about the curriculum under 
study. The determining factor is the .deqren to which the content of the 
tests may be justified on thé:basis of a well-defined content domain and 
. a representative sample of items which perinit generalizations to the ‘ 
domain. It is engouraging to note that it is the position of Hambleton 
et al. (1975) that, if the test development steps are carefully followed, 
the’ content validity of the tests should be guaranteed. This is critically 
important since the empirical validation technique suggested by Cronbach 
(1971) (computing 4 correlation coefficient for two parallel criterion- 
referenced tests constricted by comanate teams on- the. basis of the same 
domain specifications), is financially bayond the means of most statewide 


_ assessment programs. 


_ Step 1: Task Analysis , 
; / 
While this term is not commonly used in connection with statewide 


assessment, it egitimately refers to the process of defining the purpose 
and parameters) of the test in terms of the subject area and domain to be 
assessed. In general, this process is implemented by an advisory commit- 
tee representing a’ cross-section of the state's educators, administrators, 
and consumers of educat#on (perhaps parents, students, or business people). 
“The subject area to be assessed is generally mandated at the outset, and 

it is the task of the committee to specify in more detail the domain(s) 
uhich will define the scope and depth of the.assessment instrument. In 


the experiencé of this author, this activity represents no small task. 
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The context protajem. The major obstacle to the development of content- 
valid statewide assessment tests resides in the attempt to apply the , 
criterion- referenced: approach in a context for which it was Mek initially 


developed. The criteMon-feterenced mode} was PIRDENNER, for use in class- 


room management; that is, in the evaluation of earning outcomes relative 
to objectives for a specific ciprteui with identifiable characteristics. 
Accordingly, it has beén ‘noted that criterion-referenced tests are gener= 
ally administered before or after gmat units of instruction (Hambleton. & 
Novick, 1973) and are most useful when used in a pretést—teach—posttest : 
mode (Schooley et al., 1976). Difficulties arise. when the model is applied, 
within the statewide: context, to a diversity of curricula considered as one f 
comprehensiveé- program ourely on the basis of the geographical boundaries 
of the state. Developing objectives, or domains, in this instance is not 
as straightforward a task as the one undertaken es classroom teacher in 
the articulation of prescribed learning outcomes for a particular semester 
or year-long course. = 

‘Thus, the context problem, stems from the need to treat ait local - 
district programs within a given subject’ area as comprising one common 
curriculum and, therefore, to reflect in the tests being developed the. 
diverse content of these programs. While, as Reinstein (1976) points out, 
the lack.of congruence among local programs is “probably within scoantatta 
limits for convent ion-based studies" such as reading and mathematics, in- ¥ 
‘Gongruence is a Hud concern in other areas like science and social studtes. 


In the latter case, a high. degree. of latitude is evident in the peste 


brought to bear on the. task analysis and this causes consternation in the 


~ 


process of attempting to gain committee consensus. ‘This author has observed 


‘ 
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FE wy | > othe penetration of varying philosophies into’ the more convent ion-kased 
curricular areas. {such as mathematics ) as well, and concludes (as does 
Reinstein, 1976) that it is something of a problem in almost: every sub- | 
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_ ject area. a i 
These difficulties are observed in the verbal attempts by committee 
| members to find some common framework: within which to evaluate differing 


“Philosophies in terms of their appropriateness to statewide, assessment 


" efforts. In essence, the committee seeks “guidelines; for formulating 
the content of the test. . 4 
* One attempt to provide guidel ines for the task’ anatysis comes ‘in the 
form of instructions to develop a test that. rerhegt “survival skiTlg" or 
“minimum competencies" within the subject area. One should be alerted | 
to the problems to be encountered in attempting to define such concepts. 
Some practitioners have a tendency to simpTify these concepts to the. point 
of questionable usefulness by specifying a purely empirical définition (vor 
example, a minimal competence. isa performance which 90% of the current 
* student ponutetion within a grade level ‘are expected to\pave mastered). 
This strictly empirical approach, is beset by theoretical difficulties ind 
that the definition of “minima} competency” is subject to change from year 
to gear based on the competencies existing | in the_ apulation at a ‘en 
tine. While it is certainly possible and sccabtable that a set of minimal 


: competenctés wi)l change in a world that is characterized by changing de- 


should be based solely on the changing skill levels of individuals. 


A more useful approach is suggested here:- charge the advisory com- 


mittee to identify those domains that are reflective of curriculym content 


% 10° 


; minds on indtviduatc, Nt is counterintuitive that the changes in definition. 
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to which, in the committee's view, al]. s Students in the grade atthe time . 4 a 
of testing have been exposed. This does. not imply, as noted earlier, 
that. some large proportion has masteref# the objective, bnly. that the con= i 
tent has been widely tausht. An immediate Yeaction may be anticipated-~ . 
the: comptaint that, given varying curricular programs across schools, the . z 
approach is difficult, if not impossible, to implement. This author con-  ! / 
tends: that, if the mpareaeh is used to provide foo to the committee, - [ 
“rather than to rigidly restrict the test ‘evel: effort, it can provid 
a useful means for a "First pass" delimitation of th the domain to be assessed.” 

- Once this “common= ground" approach has been,used to, delineate domains, 
the conmi tte nay then: apply additional guidelines to “expand the COVEPAYS: 
There may be, for example, an interest in identifying additional domains 
. that represent -“ideal" outcomes. This interest is often a function of the 
transitional phases within the subject area (as, for example, in the "metric 
movement” in mathematics instruction of the intercurricular-concepts move- 
ment in social studies andicareer education). In,these cases the committee. ‘ 
is specifying a domain’ which the committee does not fully expect all stu- 
dents to have encountered, but which -represents an ideal learning’ outcome. 
‘ The inclusion of these "ideal" domains in the set may serve to generate 
‘baseline’ data and/or to set a policy direction for high-priority curric- 
ular or program, development within the state. 

It. should be pointed out-here that a task analysis based on the “common 
ground" and/or "ideal" approach does not necessarily ensure vat the com- 


plete set of resulting doma ing will match exactly the euehteuton in every. . | 


school in the state. In this feyard, the approach may be open to charges 


of irrelevancy to local needs and goals similar to those, originally levied 
. eA 


a 
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against, norm-referenced tests which were discarded in favor of crttertons 

‘referenced tests. However, where local districts are using the reeults 

for their own evaluational purposes and ‘hate such "misriatch" occurs, j 

data on irrelevant domains wiy simply be ignored by district personnel. | 

‘Further, if.state agencies are using the results to monitor neat perform- 

, ance, emphasis should be flaced on domains relevant to the local situation. 
Committee members have been observed to achieve a high degree of. con- 

sensus on the task analysis using the above approach (see, for example, 

the Connecticut Assessment of Educational Progress in Libicnaetes. 1976) 

‘in spite of the context within which they must operate. The outcome of : 

this process generate takes the form of a topical outline. ‘It is this | 

outline, a list of donAn descriptors (e.9., Addition") or general behav- 

fora} objectives (e.g., "Possesses numerical skills useful in the world 


of work"), which forms the basis of the detailing process in Step 2. 


‘ ’ 


| 
Step 2: Definition of ‘the Content Domain 


| 


Defining specifically the content domain of statewide assessment tests ° 
is equivalent to writing for welecting) behavioral objectives. Given the 
time “commonly available, it is beyond the means of the committee to specify 
content either via item generation rules (Borduth, 1970; Hively et al., 
1973) or via “ampli fied" objectives (Popham, 1974). The goal, then, is to 
produce a set of objectives, each of which is explicit enough to define 
‘ the domain of items which ay te legitimately referenced to Tt: This is 
- {mportant primarily to the content validity of the test,:and secondarily 

to the need to specify clearly to local consumers of the test results the ” 


domains assessed. 
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The first problem is to determine the number of objectives to be . 
identified or, as Popham (1976) puts it,” “How large a chunk of learner | 
behavior should be assessed by the test?" It is Onderstood that an ex- 
plicit objective, by definition, must display a certain degree of spe- 
cificity. ‘Given the level of specificity adopted, One or more (perhaps 
‘numerous) Objectives) may be identified for each of the domains in the 
topical outline. And certainly, for each objective identified, there 
must be "room" dn the test for a "sufficient" number of items to permit 
general ization from performance on the ttem set to performance on the 
“objective _# ; i 

; | Since” there are rather severe time, imitations on Statewide assess- 

san tests, the committee must deal with the trade-off ataieh the number . 
of objectives that can be assessed and the ‘bee of items per objective 
that can be included. Given the task of assessing a subject area, com- 
mittee nenbers tend to be highly concerned with ‘subject coverage, and an 
spite of time constraints, tend to restst limiting the knowledge or skills 
covered by the test. Unfortunately, _ they tend, therefore, to reduce the 


specificity of’ a + ll in order to widen the domain of (types ‘of) items’ 


. “which may be witch to them. This prattice violates ‘Popham's (1976) rule 


that the magnitude of behavior(s) assessed should ee sacrifice the test's 


deserted dirity. « ™ ; “<F * 


\ 


. It fs strongly recommended von boanit ttee be ee of this problem- 
and urged ve limit the test to a set of high-priority objectives charac- 
terized by, Sufficient ‘explicitness and ‘specificity. This does not imply 
that theobjectives finally selected must be so explicit as to be al 
(see Ebel, iat but rather, that they aust be narrow enough to focus on ; 


’ : ras 
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a restricted ae of item types., Neither does it imply that the commit- 
tee does not recognize excluded domains to be important in the general 
sense; rather, it suggests that the domain has been limited to maximize 
* the lsefulness of the test for decision- -making purposes. 

To guide the committee through this content definition activity, it 

is reconmegded that the parameters of the test be chns idered at the gut- 
set. That is, the committee must consider the time allocated “f6r testing 
and the total number of items which can be administered within: that time. 
(Given conventional multiple-choice items, one minute per item is a use- 
ful rule of thumb; where items are unusually long, as in reading compre- . . 
hension, or where open-ended items are invol ved, time per’ item must be 
adjusted accordingly. ) 

- The next step is to set a minimum ‘number of items per objective. Un-, 
“fortunately this number usually cannot be one that is ideal in terms of 
‘eneiriva test reliability. Hambleton et al. (1975) indicate that some 
number less Swi 25 items per domain is recommended, while. Popham (1976) 
suggests that, in order to reliably assess a domain, the number of items 
should “more than likely be between 10 and 20 than between one and five." 
These guidelines would restrict a one-hour con entional multiple-choice 
test to the measurement of between three and six objectives—a situation « 
that most statewide comittess would find difficult to live with. A com: 
mon practice is to adopt, a‘minimum of four ttems per objective (which 
meets ahs minimum for-stable relfability ‘estimates set by Schooley et al., 
7%. 2 . ar 
~ a ‘minimum of four items per objective is established, the test 


¥ 


described above Guild contain up to 25 objectives, but more likely (given . 


tg DS : 
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varying item Jengths) would contain something closer to 15 objectives. 


-Recatl that the issue is to delimit the number of objectives which the 


commi ttae may select far the test. Once this limit is determined, given - 


- the test parameters, the committee can be guided to write or select ob- 


jectives that are specific enough to be measured ‘by a sample of only four 


items. Clearly, this gauging of the appropriate level of: specificity is, 


- at present, an infuitive process. In practice, committee members exercise 


their individual intuitions, achieve a surprising degree of agreement on 
level of specificity when the issue is clearly understood. If they are 
urged to consider the range in type and number of items which would be 
subsumed by a given. objective, they are able to identify those objectives 
which are too catieal or broad t@ be offuse. 

’ These suggestions Serve only as a practical guide for completing the 
most-difficult step.in the test production process. The author admits 
that the suggestions perhaps may be more practical ly-sdund than theoreti- 
cally sound, However, given the current state of the art of developing 
et eaitan-ptandiede statewide assessment tests, they may serve to.bring ig 
uS oné step closer to the production of fully valid and reliable ou , 


Step 3: Generation: of Referenced Items ° bak 


« 


, Many, statewide assesement programs involve the generation of an 
‘en pool for each specified objective through a search for existing mate- 
rials. One problem that often arises with this ee is that a suf fi- 
cient number, of items appropriately matching each objective cannot be 


located or obtained. The question remains whether the perceived dearth 


of- materials is a result’ of time constraints which make unfeasible a 


15 
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comprehensive search, or whether extensive materials are ak yet available 

in.the field. This problem jis compeunded when a committee has diectea to 

write original and rather uniquely-defined objectives. Often, after re- 

‘viewing the available items, the committee is forced to reevaluate their’ - - 


— - ; 
objectives, and perhaps, to rewrite them. 


‘ 


This activity may, in fact, be valuable since it can result in the 


, ~ 


’ refinement of the objectives and focuses attention on the need for objective/ 


item congruence. What should be avoided is the tendency to "cling to" the 


phraseology of the objective and to permit "slight deviations” in the types of 


items included in the matching pool. This tendency to create an item pool 
where truly none exists defeats thie, purpose of ar ee ee jase 
ment and results in tests that ae limited in content validity and, there« 
fore, usefulness. Where a soffici nt item pog! is unavailable, objectives 
should be redefined or scares 


e 5 opt ie 
Where time constraints are sad an issue (as, for example, in those 
programs which include at least a full year developmentat phate prior ‘to 
actual testing), generation of original, items tends to follow conventional 
guidelines. -It is encouraging’ to note that, in these cases, item writers 
are sometimes provided with either amplified objectives or item proto- 
types ase, for exampTe, the Ohio Statewide Student Needs Assessment, 
1976-77). Highest productivity is achieved when the item-writing team ar 
has the opportunity to interact with the objectives-writing team since 
objective/item congruence is then maximized. _Schooley et al. (1976) sug- 
.gest that the item and objective teams: should be one and the same. Thjs 
ts somet ines the case in statewide assessments (see, for example, the - = 


Missouri Statewide Assessment, 1976), but is not a frequent occurrence. - 


» 


e 
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Any of the a ve approaches (separate teams, interacting teams, or the 
combined team @Piroach) is fully workable, given the production of objec- 
- ’ * tives whose substance can be clearly agreed upon. 3 ‘ 
~ ; ‘ne additional approach, however, 46 ‘not recommended: the generation "\ 
+  +0f item pools without a corresponding predetermined set, of objectives. 
< | a Some statewide committees who use the "available materials” method have . 
a ft ; : adopted this approach due to pareal ved difficulty in gaining consensus on : 
/ objectives at the outset. Rather, the committee reviews all-of existing 


items which can be located, and determines for each one whether or not -it 


is appropriate for statewide assessment. Here, the committee members are « 
-ustng an internalized, but unarticulated, set of standards to identify the 
item poo}. For some reason, they find it easier to achieve consensus on 
individuat items than on objectives. Once the item pool is generated, they 
then return tg the bypassed. step of domain specification and write objec- 
tives based on the items identified. -This approach is not recommended be- 
?. : cause it generally results in a pool of items that cannot be well justified 


in terms of objective and curriculum coverage. 


Step 4: Item Analysis — q . 


This step, the procedure of checking the quality of the items, applies 
e primarily in cases wheré original Tens are produced for the statewide 
tests. Where existing items are used, they were previously used. Where 
new items are written for the assessment, this author has noted the use 
7 “of many of Hambleton et al.'s (1978) procedures for determining the extent 
to which items reflect. rid respective content domains. These include — 


‘content specialist fasta: item difficulty and item discrimination indices. 


Yo Fk 
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The only method not observed is that involvi g “item change statistics," 


. Since statewide assessments do not involve tasting before and after in-> 


struction. . a . 

It is encouraging that item analysis tec sie commonly being 
used in connection with statewide assessment tests. What is less encour- . 
aging 1s. the fact that these Pe beling implemented WATIONE , 
reference to available statistical procedures. Ratings by content spe- 
cialists (for example, on a four-point releeney scale for sith item), 


item difficulty indices (for Sante, the percentage of students scoring - - 
correctly on an item in a field test fituatjon), a and item ois : 


indices (for example, the proportions of students in ae and "low 
achieving" groups correctly answering each item) tend to be evaluated on 


a visual scanfiing basis. That is, for example, if the range in difficulty 


level across the set ef items referenced to an objective Zooks too wide, 
“deviant” items are deleted from the pool. There is little evident use. 


of statistical procedures uggested by Hambleton et al. (1965): 1) Colen's / 


(1960) coefficient kappa measure the agreement between ratings of items 


. 
made by different content specialists, or 2) Cochran's Q test to determine 


whether item difficulties are equal. = a y 
; The reliance on visual scanning methods as opposed to, statistical 
techniques Suggests that practitioners have hot‘yet taken advantage of | 

recent developments in ‘data- analytic procedures for criteriqn-referenced “ 
tests. This may reflect the raditional gap between theory and practice, 


but nevertheless, should be corrected if statewide assessment tests are 


to approach an optimal level of validity: ; ae 
18 


4 


’ . . 


’ 


St : ‘Ttem Selection 


it ‘his been contended that "strong" crtterion-referenged interpreta- 
tions of tést scores nie made possible only by a random selection of items ° 
from the domain (Hambleton. et al:, 19753, MiTIman, 1974). It 
is here ‘that. statewide assessnents encounter the.most difficulty in produc- 


ie tests that are useful to “the decision-making purpose. In assessments 


that focus on ‘identifying exiating test items, it is frequently difficult 


to locate .enough mecehing items for each aupeeEtie from which to randomly 
° 


“select. “if randomly selecting” four out. of thve existing items truly sale 

E : fies as random selection from the domain, then the problem is resolved. 

7 ; However, it is unclear whether such limited vandentiees permits valid gen- 
wraticastdns pave items to the domain. Where larger numbers of match- 
ing valid items are available, it is a straightforward matter to ey 
selett from the pool. 


Ng . In assessments that, involve production of original items, the random 
“@ 


selection requirement implies that a greater number of items than will 
Ty be used’ are to be written. While iy BYSChISRS this is often the 
case, the number of. items produced rarely efceeds the number required for’ 
the fest by more than four or five due to the time and expense involved . 
‘in “tem writing. This “over produc sion" of items is generally not intended 
‘to permit later’ random selection, but’ rather to allow for the deletion of 
items that y-on, the basis of item analysis, do not prove to be valid indi- 
cators of the domain. If useful tests depend on the random selection pro- 
cess, then Tegreasen funding must be made available to aig the genera- ee oe 


© 


& tion of larger numbers of items. aa ee 


A 


7 


‘', It is important that committee members understand the necessity of 
random selection in order that :tkey do not: insist on selecting the items ~ 


they "like best." The most useful and practical approach is to instruct 


‘ 


committee members to review all available vatid items and to identify, 
those that are "acceptable" for statewide, agsessnent purposes. In theory, 
all valid items should be ‘accestabla; however, certain eccentricities pre- 
, vant the translation from theory to practice. The review for accep- 
“gabitity will result in a ‘restriction, usually minor, of the item pool, 
: oe but . should allow the random selection process: to bé implemented without 


‘ complaint’ from the cormittee. a : 


Number of items per objective. * Given. that the number of items for 


ath objective must meet. the mininnd, required for reliability, the actual 


number saleceed: may be constant across all objectives or, alternatively, - ° 


vv 


; may vary across objectives. Regardless of which alternative is adopted, 
fae the choice should reflect some theoretical rationale, as opposed to merely 
| the number of items is constant across objectives. This implies that in 
” the tommi ttee’s estimation, the objectives are of equal importance. If. 
the number of items vary across objectives, they should vary in ferme “of, 
\ ” ‘the relative importance ascribed to each objective. 7 x 
Since; results are almost uniformly reported ine terms of the propor- ie - 
Aen of Stems for each objective answered correctly, a constant number of . . 
items across objectives tends to increase ‘the ease with which results can 
be meaningte sy interpreted. “Ina sense; ° this implies to users of the 
“results that the proportional r results can be atven equal weight. Popham's 
(1976) comment on "behavioral homogéneity" may be useful here. If all . 


Objectives can be designed to reflect approximately equal amounts of 
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“required instructional time,” then Selecting equal numbers of items 


across objectives becomes particularly defensible. If this is not pos- 

sible; then the differing number of items per objective should reflect 

some other criterion the committee is using regarding the relative impor- 
‘tahce of the objectives. Further, the nunber of’ items per objective should , 


“be reported along with the average scores to increase data meaningfulness ; 


Step 6: Test Reliability and Validity 
;" = “ 

The issues involved in Soracntning test Peary are not treated : ‘, 
here since tie focus is prinarily on the content “validity of criterign-% ' 
referenced tests. It should be suffictent to note here the position ‘of 
Hambleton et al. (1975) that if the foregoing steps in fost construction, ? 
are followed. closely, then the convkent ewaVtdtey of the tests should be - 
ensured. The previous discussion,has hight ighted problem areas in each ~ 
step that are encountered in the context af statewide assessment. Only 
to the extent that these problems are overcome by appealing to the guide- . 
- dines suggested will the resulting tests be content valid and asefiil for ~ 
decision making. It would, of course, be desirable to check the content a, Be 
validity of the tests through the use of Lronbach's (1971) techniques _ : 
‘test construction described. earlier.” However, ‘thé. procedure — continues ° 
to seem beyond the means of most statewide assessments. In the absence 
~ of such validation procedures, following closely the ‘quidel ines for ‘valid 


ar 


test construction adie all me more important. i 


Underlying the criterion-referenced test construction procedures out- 
lined in this paper is the need for allocating sufficient time and-re- - 
sources to thie’ develonnantel process. This need, raised by Reinstein 

_ (1976) in connection with criterion-referenced test daisioonan’ at the 
local level, is magnified in the context of statewide assessment. In 
- the oress to shift from norit-referenced’to criterion-referenced testing 
‘at the.state level, the need to make avaitable increased and adequate 
time for test development is too often overlooked. The task of selecting 
and ordering an existing norm-referenced test is far .less'awesome than 


the task. of developing from'scratch" a valid criterion-referenced instru- 


ment. Some states (e.g., Minnesota and Rhode Island) have found that a 


two-year timefrane is required to permit the implementation of valid con- 


struction methods,’ while other states have required that the process be 
completed within three to six months. Committees that are severely re> 
strained in terms of time and resources, available cannop”be expected to 
produce something other than a hastily piodacel test that cannot be jus- 
tified. in terms of curriculum coverage or content validity. . i 
The extent to which the test construction steps can be followed 
“closely to yield a ‘content. valid test sseentiaa. the extent to,which the 
tests can be justifiably used to evaluate the porricula or prograns under 
study. Fite a decisionstheoretic point*of view, the scores of Students: ae 
‘on the tests are used to make decjggons about - the: performance ‘statut of 
oS individuals. and/or the effectiveness > BF statewide curricular programs. : If 
the content of the test does not adequately reflect the behaviors legiti- 
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mately inferrable from those delimited. by the criteria (Popham & Husek, 
* 1969), then decisions based on the results are likely to’ be” erroneous. 


\ ® 
mt 5 Teachers have every right to expect ‘that,’ ifevaluations of the 


“Tearing 0 outcomes of their panne are to be nade, the-evaluations must 
’ -o be made on the basis of inferences that are well-founded. Teachers as 
, as other-local consumers of statewide.test results are very sensi- 
tive to the content validity. of thé tests. They frequently make accurate ~ 
Judgments as to the importance of sthe didectivex selected and the "good- rr 


a 


ness- -of-fit" of the items referenced to each objective. Where the tests 
“are weak in these respects, results tend to be disregarded for local ie 
poses and the\eyahuat ions or aisaiadabtins made by state agencies’ on 

the basis of test results tend to be ignored. While “Stoaente have an equal 
right to be evtliintid ‘on: the basis of sound instruments , they rarely have 
an \ opportunity to reject the conclusions drawn on the basis. of their 
test scores. , . 

In sammary, it is to be cancluded that the content validity of mea- 
suring instruments must be carefully established in order to ensure mean- ; % 
\ ingful and defensible decision making. The risk involved, in using an in- 

‘ valid.test must -be*judged in corme-ot the costs (psychological, financial, 


etc.) attendant on making erroneous decisions in a given situation. 
, ; ( -_ a 
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